As the workflow for ATAC-seq QC was established in step 2, I performed this workflow on the entire data set. However, performing pseudo-bulking and normalization and subsequent visualization with the data from all cells is very computationally intensive. Therefore, pseudo-bulking and manual ln(1+x) normalization was performed separately from the visualization here.

Libraries

These are the packages required for the code below.

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(binom)
library(broom)
## Warning: Paket 'broom' wurde unter R Version 4.4.3 erstellt
library(Seurat) #for rna-seq data, atac-seq data
## Lade nötiges Paket: SeuratObject
## Warning: Paket 'SeuratObject' wurde unter R Version 4.4.3 erstellt
## Lade nötiges Paket: sp
## 
## Attache Paket: 'SeuratObject'
## 
## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, t
library(Signac) #for atac seq data

1. Data loading and inspecting

First, I loaded these data sets into R to work with them. The data sets were graciously extracted and compiled by Lauren Saunders from Lin et al. 2023.

readRDS("Data/Lin_2023_zfish_snATAC-seq/Cell-type_Peak_Matrix.rds") -> zfish_snATAC_seq_pk_mtrx #count matrix
read_csv("Data/Lin_2023_zfish_snATAC-seq/atac_all.metaData.csv")-> zfish_mta_data # meta data
## New names:
## Rows: 50637 Columns: 24
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): ...1, Sample, stage, Clusters, celltype, predictedCell, predictedG... dbl
## (17): TSSEnrichment, ReadsInTSS, ReadsInPromoter, ReadsInBlacklist, Prom...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Then, I inspected the data sets with class() and dim(). The matrix zfish_snATAC_seq_pk_mtrx contains the peaks as rownames and columns as cell names and counts as entries. As this is a sparse matrix, the 0s are written as dots. In addition, the data frame zfish_mta_data contains the meta data.

unique(zfish_mta_data$Clusters) #how many clusters in total
##  [1] "C3"  "C1"  "C24" "C5"  "C4"  "C2"  "C18" "C7"  "C8"  "C16" "C25" "C14"
## [13] "C21" "C19" "C12" "C17" "C15" "C10" "C9"  "C11" "C23" "C22" "C13"
class(zfish_snATAC_seq_pk_mtrx)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
class(zfish_mta_data)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
dim(zfish_snATAC_seq_pk_mtrx)
## [1] 370058  50637
zfish_snATAC_seq_pk_mtrx[1:10, 1:2]
## 10 x 2 sparse Matrix of class "dgCMatrix"
##                  24hpf_1#24hpf_1_BC00224_N06 24hpf_1#24hpf_1_BC00014_N03
## chr1:5232-5732                             .                           .
## chr1:5787-6287                             .                           .
## chr1:10088-10588                           .                           .
## chr1:10991-11491                           .                           .
## chr1:11895-12395                           .                           2
## chr1:12474-12974                           1                           .
## chr1:14016-14516                           .                           .
## chr1:14704-15204                           1                           .
## chr1:16672-17172                           3                           .
## chr1:18404-18904                           3                           .
zfish_mta_data

For further analysis in this ATAC-Seq-QC to determine whether this data set can be used to create a model which predicts chromatin accessibility based on genomic sequence alone, I will create a seurat object with the code below.

zfish_atac=CreateSeuratObject(counts = zfish_snATAC_seq_pk_mtrx, assay = "atac", meta.data = zfish_mta_data) #combines matrix and meta data into a seurat object
## Warning: Setting row names on a tibble is deprecated.

Here is an overview of the seurat object, obtained with the glimpse function.

glimpse(zfish_atac)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ atac:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
##   ..@ meta.data   :'data.frame': 50637 obs. of  27 variables:
##   .. ..$ orig.ident       : Factor w/ 7 levels "10hpf","12hpf",..: 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ nCount_atac      : num [1:50637] 30862 26934 24272 24893 29109 ...
##   .. ..$ nFeature_atac    : int [1:50637] 15150 13872 13368 11756 13537 11722 12082 12192 9450 11516 ...
##   .. ..$ ...1             : chr [1:50637] "3hpf_1#3hpf_1_merge_BC0443_N27" "3hpf_1#3hpf_1_merge_BC0069_N07" "3hpf_1#3hpf_1_merge_BC0033_N05" "3hpf_1#3hpf_1_merge_BC0028_N08" ...
##   .. ..$ Sample           : chr [1:50637] "3hpf_1" "3hpf_1" "3hpf_1" "3hpf_1" ...
##   .. ..$ TSSEnrichment    : num [1:50637] 6.5 4.21 8.49 5.02 6.57 ...
##   .. ..$ ReadsInTSS       : num [1:50637] 1452 829 1171 684 1337 ...
##   .. ..$ ReadsInPromoter  : num [1:50637] 7341 4887 5702 3907 6664 ...
##   .. ..$ ReadsInBlacklist : num [1:50637] 0 0 0 0 0 0 0 0 0 0 ...
##   .. ..$ PromoterRatio    : num [1:50637] 0.0531 0.0356 0.0488 0.0359 0.0616 ...
##   .. ..$ PassQC           : num [1:50637] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ NucleosomeRatio  : num [1:50637] 1.56 1.36 1.3 1.41 1.35 ...
##   .. ..$ nMultiFrags      : num [1:50637] 16414 15982 11720 12508 10532 ...
##   .. ..$ nMonoFrags       : num [1:50637] 27022 29078 25398 22610 23029 ...
##   .. ..$ nFrags           : num [1:50637] 69102 68665 58378 54390 54053 ...
##   .. ..$ nDiFrags         : num [1:50637] 25666 23605 21260 19272 20492 ...
##   .. ..$ BlacklistRatio   : num [1:50637] 0 0 0 0 0 0 0 0 0 0 ...
##   .. ..$ stage            : chr [1:50637] "3hpf" "3hpf" "3hpf" "3hpf" ...
##   .. ..$ Clusters         : chr [1:50637] "C3" "C3" "C3" "C3" ...
##   .. ..$ ReadsInPeaks     : num [1:50637] 47511 42150 38848 33031 38019 ...
##   .. ..$ FRIP             : num [1:50637] 0.344 0.307 0.333 0.304 0.352 ...
##   .. ..$ celltype         : chr [1:50637] "blastomere" "blastomere" "blastomere" "blastomere" ...
##   .. ..$ predictedCell    : chr [1:50637] "6h_3 CELL4645_N1 _" "3h1_CELL1337_N1_3h1" "3h1_CELL1337_N1_3h1" "3h1_CELL1197_N1_3h1" ...
##   .. ..$ predictedGroup   : chr [1:50637] "margin" "blastomere" "blastomere" "blastomere" ...
##   .. ..$ predictedScore   : num [1:50637] 0.605 0.982 0.986 0.576 0.598 ...
##   .. ..$ DoubletScore     : num [1:50637] 323 169 323 140 323 ...
##   .. ..$ DoubletEnrichment: num [1:50637] 15.3 7.9 16.6 7.1 24.3 3.4 16.2 2.3 7 5.8 ...
##   ..@ active.assay: chr "atac"
##   ..@ active.ident: Factor w/ 7 levels "10hpf","12hpf",..: 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..- attr(*, "names")= chr [1:50637] "24hpf_1#24hpf_1_BC00224_N06" "24hpf_1#24hpf_1_BC00014_N03" "24hpf_1#24hpf_1_BC00075_N02" "24hpf_1#24hpf_1_BC00171_N03" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "SeuratProject"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 5 0 2
##   ..@ commands    : list()
##   ..@ tools       : list()

2. Data transformation

Next the data was transformed. Overall, it was pseudo-bulked,log-normalized manually and saved as an RDS file for later use. To generate the matrix psd.bulk.zfish_atac, which contains all un-normalized counts per cell type, I pseudo-bulked the count data from the zfish_atac object by setting the arguments to: return.seurat=F and no specified normalization method, which is the default of this function.

#pseudobulking

psd.bulk.zfish_atac= AggregateExpression(zfish_atac, group.by = "celltype") 

#output

glimpse(psd.bulk.zfish_atac)
## List of 1
##  $ atac:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:8324533] 0 1 2 3 4 5 6 7 8 9 ...
##   .. ..@ p       : int [1:24] 0 369694 739309 1109350 1410735 1780791 2066041 2435528 2799985 3170031 ...
##   .. ..@ Dim     : int [1:2] 370058 23
##   .. ..@ Dimnames:List of 2
##   .. ..@ x       : num [1:8324533] 142 156 179 224 290 309 242 238 274 265 ...
##   .. ..@ factors : list()

After that, this matrix was normalized manually, by dividing the sum of the counts by the number of cells in each celltype then, by multiplying that number by a 1000 for the average in 1000 cells and lastly by transforming the matrix with ln(x+1).

To achieve this, the number of cells of each cell type had to be extracted first.

zfish_atac@meta.data %>% 
  group_by(celltype) %>% 
  tally() %>% 
  pivot_wider(names_from = celltype, values_from = n) %>% 
  as.matrix() -> n_cells
n_cells
##       EVL  UND YSL YSL/presumptive endoderm anterior/posterior axis blastomere
## [1,] 1678 2408 633                      803                    1203       4676
##      central nervous system digestive system epiblast erythroid lineage cell
## [1,]                   3672              271     8007                    279
##      forebrain hypoblast immature eye integument lateral plate mesoderm
## [1,]      1115      4024         1374       1336                    944
##      mesenchyme cell musculature system neural crest neural keel
## [1,]            2764               1355         1177        3525
##      neural stem cell periderm/epidermis primary neuron segmental plate
## [1,]             4901                992           1598            1902

Then, I employed the following loop to extract the correct number of cells of each celltype to divide the counts of the matrix containing the pseudo-bulk with.

##Dividing the counts by the number of cells in each cell type

psd.bulk.zfish_atac$atac[1:5,1:5] #input
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                  anterior/posterior axis blastomere central nervous system
## chr1:5232-5732                       142        230                    216
## chr1:5787-6287                       156        231                    238
## chr1:10088-10588                     179        266                    315
## chr1:10991-11491                     224        337                    400
## chr1:11895-12395                     290        358                    392
##                  digestive system epiblast
## chr1:5232-5732                 15      414
## chr1:5787-6287                 15      505
## chr1:10088-10588               18      617
## chr1:10991-11491               30      870
## chr1:11895-12395               40      890
for (i in seq_along(colnames(n_cells))) {
  ct <- colnames(n_cells)[i]  #extracting cell type names
  
#loop along cell type names in the pseudo bulk matrix
  for (j in seq_along(colnames(psd.bulk.zfish_atac$atac))) {
    atc <- colnames(psd.bulk.zfish_atac$atac)[j]  
    
#if the colnames match divide by the number of cells
    if (atc == ct) {
      psd.bulk.zfish_atac$atac[,j] <- (psd.bulk.zfish_atac$atac[, j] / n_cells[, i])*1000
      break  # stop if everything is matched 
    }
  }
}

psd.bulk.zfish_atac$atac[1:5,1:5] #output
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                  anterior/posterior axis blastomere central nervous system
## chr1:5232-5732                  118.0382   49.18734               58.82353
## chr1:5787-6287                  129.6758   49.40120               64.81481
## chr1:10088-10588                148.7947   56.88623               85.78431
## chr1:10991-11491                186.2012   72.07015              108.93246
## chr1:11895-12395                241.0640   76.56116              106.75381
##                  digestive system  epiblast
## chr1:5232-5732           55.35055  51.70476
## chr1:5787-6287           55.35055  63.06981
## chr1:10088-10588         66.42066  77.05757
## chr1:10991-11491        110.70111 108.65493
## chr1:11895-12395        147.60148 111.15274

After that, the matrix was ln(1+x) transformed in the manner seen below

psd.bulk.zfish_atac=log(psd.bulk.zfish_atac$atac+1)

psd.bulk.zfish_atac[1:10,1:5]
## 10 x 5 Matrix of class "dgeMatrix"
##                  anterior/posterior axis blastomere central nervous system
## chr1:5232-5732                  4.779445   3.915763               4.091399
## chr1:5787-6287                  4.872720   3.920015               4.186845
## chr1:10088-10588                5.009266   4.058479               4.463426
## chr1:10991-11491                5.232184   4.291420               4.699866
## chr1:11895-12395                5.489202   4.351067               4.679849
## chr1:12474-12974                5.552408   4.325938               4.963941
## chr1:14016-14516                5.309078   3.977721               4.838441
## chr1:14704-15204                5.292494   3.822018               5.051321
## chr1:16672-17172                5.432691   4.934031               4.967738
## chr1:18404-18904                5.399441   4.641038               5.153946
##                  digestive system epiblast
## chr1:5232-5732           4.031592 3.964706
## chr1:5787-6287           4.031592 4.159973
## chr1:10088-10588         4.210952 4.357447
## chr1:10991-11491         4.715827 4.697338
## chr1:11895-12395         5.001268 4.719862
## chr1:12474-12974         5.543663 4.688185
## chr1:14016-14516         4.574102 4.496163
## chr1:14704-15204         5.140152 4.815374
## chr1:16672-17172         5.073119 5.091346
## chr1:18404-18904         5.223065 5.041749

Lastly, the transformed matrix was saved as a RDS file for further analysis.

#save as file

saveRDS(psd.bulk.zfish_atac, file = "Data/psd.bulk.zfish_atac.lg1x.rds")

Session Info

Information about the coding environment can be found below.

sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Signac_1.14.0      Seurat_5.2.1       SeuratObject_5.0.2 sp_2.2-0          
##  [5] broom_1.0.8        binom_1.1-1.1      lubridate_1.9.4    forcats_1.0.0     
##  [9] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.4        readr_2.1.5       
## [13] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.17.1       jsonlite_1.8.9         
##   [4] magrittr_2.0.3          spatstat.utils_3.1-2    farver_2.1.2           
##   [7] rmarkdown_2.29          zlibbioc_1.52.0         vctrs_0.6.5            
##  [10] ROCR_1.0-11             Rsamtools_2.22.0        spatstat.explore_3.3-4 
##  [13] RcppRoll_0.3.1          htmltools_0.5.8.1       sass_0.4.10            
##  [16] sctransform_0.4.1       parallelly_1.43.0       KernSmooth_2.23-26     
##  [19] bslib_0.9.0             htmlwidgets_1.6.4       ica_1.0-3              
##  [22] plyr_1.8.9              plotly_4.10.4           zoo_1.8-12             
##  [25] cachem_1.1.0            igraph_2.1.4            mime_0.12              
##  [28] lifecycle_1.0.4         pkgconfig_2.0.3         Matrix_1.7-1           
##  [31] R6_2.6.1                fastmap_1.2.0           GenomeInfoDbData_1.2.13
##  [34] fitdistrplus_1.2-2      future_1.40.0           shiny_1.10.0           
##  [37] digest_0.6.37           colorspace_2.1-1        S4Vectors_0.44.0       
##  [40] patchwork_1.3.0         tensor_1.5              RSpectra_0.16-2        
##  [43] irlba_2.3.5.1           GenomicRanges_1.58.0    progressr_0.15.1       
##  [46] spatstat.sparse_3.1-0   timechange_0.3.0        httr_1.4.7             
##  [49] polyclip_1.10-7         abind_1.4-8             compiler_4.4.2         
##  [52] bit64_4.6.0-1           withr_3.0.2             backports_1.5.0        
##  [55] BiocParallel_1.40.2     fastDummies_1.7.5       MASS_7.3-64            
##  [58] tools_4.4.2             lmtest_0.9-40           httpuv_1.6.15          
##  [61] future.apply_1.11.3     goftest_1.2-3           glue_1.8.0             
##  [64] nlme_3.1-167            promises_1.3.2          grid_4.4.2             
##  [67] Rtsne_0.17              cluster_2.1.8           reshape2_1.4.4         
##  [70] generics_0.1.3          gtable_0.3.6            spatstat.data_3.1-6    
##  [73] tzdb_0.4.0              data.table_1.16.4       hms_1.1.3              
##  [76] XVector_0.46.0          BiocGenerics_0.52.0     spatstat.geom_3.3-5    
##  [79] RcppAnnoy_0.0.22        ggrepel_0.9.6           RANN_2.6.2             
##  [82] pillar_1.10.2           vroom_1.6.5             spam_2.11-1            
##  [85] RcppHNSW_0.6.0          later_1.4.1             splines_4.4.2          
##  [88] lattice_0.22-6          bit_4.6.0               survival_3.8-3         
##  [91] deldir_2.0-4            tidyselect_1.2.1        Biostrings_2.74.1      
##  [94] miniUI_0.1.1.1          pbapply_1.7-2           knitr_1.50             
##  [97] gridExtra_2.3           IRanges_2.40.1          scattermore_1.2        
## [100] stats4_4.4.2            xfun_0.52               matrixStats_1.5.0      
## [103] UCSC.utils_1.2.0        stringi_1.8.4           lazyeval_0.2.2         
## [106] yaml_2.3.10             evaluate_1.0.3          codetools_0.2-20       
## [109] cli_3.6.1               uwot_0.2.2              xtable_1.8-4           
## [112] reticulate_1.40.0       munsell_0.5.1           jquerylib_0.1.4        
## [115] Rcpp_1.0.14             GenomeInfoDb_1.42.3     globals_0.17.0         
## [118] spatstat.random_3.3-2   png_0.1-8               spatstat.univar_3.1-1  
## [121] parallel_4.4.2          dotCall64_1.2           bitops_1.0-9           
## [124] listenv_0.9.1           viridisLite_0.4.2       scales_1.3.0           
## [127] ggridges_0.5.6          crayon_1.5.3            rlang_1.1.4            
## [130] fastmatch_1.1-6         cowplot_1.1.3